AITopics | data category

Collaborating Authors

data category

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Understanding the Influence of Synthetic Data for Text Embedders

Springer, Jacob Mitchell, Adlakha, Vaibhav, Reddy, Siva, Raghunathan, Aditi, Mosbach, Marius

arXiv.org Artificial IntelligenceSep-9-2025

Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.06184

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Dominican Republic (0.04)
North America > Canada > Quebec > Montreal (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.47)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

From On-chain to Macro: Assessing the Importance of Data Source Diversity in Cryptocurrency Market Forecasting

Demosthenous, Giorgos, Georgiou, Chryssis, Polydorou, Eliada

arXiv.org Artificial IntelligenceJul-16-2025

This study investigates the impact of data source diversity on the performance of cryptocurrency forecasting models by integrating various data categories, including technical indicators, on-chain metrics, sentiment and interest metrics, traditional market indices, and macroeconomic indicators. We introduce the Crypto100 index, representing the top 100 cryptocurrencies by market capitalization, and propose a novel feature reduction algorithm to identify the most impactful and resilient features from diverse data sources. Our comprehensive experiments demonstrate that data source diversity significantly enhances the predictive performance of forecasting models across different time horizons. Key findings include the paramount importance of on-chain metrics for both short-term and long-term predictions, the growing relevance of traditional market indices and macroeconomic indicators for longer-term forecasts, and substantial improvements in model accuracy when diverse data sources are utilized. These insights help demystify the short-term and long-term driving factors of the cryptocurrency market and lay the groundwork for developing more accurate and resilient forecasting models.

artificial intelligence, cryptocurrency market, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2506.21246

Country: Europe > Middle East > Cyprus > Nicosia > Nicosia (0.40)

Genre: Research Report > New Finding (0.68)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.31)

Add feedback

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Eshuijs, Leon, Wang, Shihan, Fokkens, Antske

arXiv.org Artificial IntelligenceMay-12-2025

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.06032

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (0.49)
Leisure & Entertainment (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

Qiu, Wenjun, Lie, David, Austin, Lisa

arXiv.org Artificial IntelligenceJan-15-2024

A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving class and data category balance in the data set. The combination of these techniques allows Calpric to produce models that are accurate over a wider range of data categories, and provide more detailed, fine-grain labels than previous work. Our crowdsourcing process enables Calpric to attain reliable labeled data at a cost of roughly $0.92-$1.71 per labeled text segment. Calpric 's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.

category, data category, privacy policy, (15 more...)

arXiv.org Artificial Intelligence

2401.08038

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Conversational Financial Information Retrieval Model (ConFIRM)

Choi, Stephen, Gazeley, William, Wong, Siu Ho, Li, Tingting

arXiv.org Artificial IntelligenceNov-10-2023

With the exponential growth in large language models (LLMs), leveraging their emergent properties for specialized domains like finance merits exploration. However, regulated fields such as finance pose unique constraints, requiring domain-optimized frameworks. We present ConFIRM, an LLM-based conversational financial information retrieval model tailored for query intent classification and knowledge base labeling. ConFIRM comprises two modules: 1) a method to synthesize finance domain-specific question-answer pairs, and 2) evaluation of parameter efficient fine-tuning approaches for the query classification task. We generate a dataset of over 4000 samples, assessing accuracy on a separate test set. ConFIRM achieved over 90% accuracy, essential for regulatory compliance. ConFIRM provides a data-efficient solution to extract precise query intent for financial dialog systems.

information retrieval, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2310.13001

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Spain (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.50)

Industry:

Law (1.00)
Government (1.00)
Banking & Finance > Economy (0.94)
Banking & Finance > Trading (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Data-centric Operational Design Domain Characterization for Machine Learning-based Aeronautical Products

Kaakai, Fateh, Adibhatla, Shridhar "Shreeder", Pai, Ganesh, Escorihuela, Emmanuelle

arXiv.org Artificial IntelligenceJul-14-2023

We give a first rigorous characterization of Operational Design Domains (ODDs) for Machine Learning (ML)-based aeronautical products. Unlike in other application sectors (such as self-driving road vehicles) where ODD development is scenario-based, our approach is data-centric: we propose the dimensions along which the parameters that define an ODD can be explicitly captured, together with a categorization of the data that ML-based applications can encounter in operation, whilst identifying their system-level relevance and impact. Specifically, we discuss how those data categories are useful to determine: the requirements necessary to drive the design of ML Models (MLMs); the potential effects on MLMs and higher levels of the system hierarchy; the learning assurance processes that may be needed, and system architectural considerations. We illustrate the underlying concepts with an example of an aircraft flight envelope.

artificial intelligence, machine learning, mlcodd, (15 more...)

arXiv.org Artificial Intelligence

2307.07681

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)

Genre: Research Report (0.50)

Industry:

Transportation > Air (1.00)
Government > Regional Government > North America Government > United States Government (0.93)
Aerospace & Defense > Aircraft (0.69)
Automobiles & Trucks (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.66)

Add feedback

Firmware implementation of a recurrent neural network for the computation of the energy deposited in the liquid argon calorimeter of the ATLAS experiment

Aad, Georges, Calvet, Thomas, Chiedde, Nemer, Faure, Robert, Fortin, Etienne Marie, Laatu, Lauri, Monnier, Emmanuel, Sur, Nairit

arXiv.org Artificial IntelligenceMay-17-2023

The ATLAS experiment measures the properties of particles that are products of proton-proton collisions at the LHC. The ATLAS detector will undergo a major upgrade before the high luminosity phase of the LHC. The ATLAS liquid argon calorimeter measures the energy of particles interacting electromagnetically in the detector. The readout electronics of this calorimeter will be replaced during the aforementioned ATLAS upgrade. The new electronic boards will be based on state-of-the-art field-programmable gate arrays (FPGA) from Intel allowing the implementation of neural networks embedded in firmware. Neural networks have been shown to outperform the current optimal filtering algorithms used to compute the energy deposited in the calorimeter. This article presents the implementation of a recurrent neural network (RNN) allowing the reconstruction of the energy deposited in the calorimeter on Stratix 10 FPGAs. The implementation in high level synthesis (HLS) language allowed fast prototyping but fell short of meeting the stringent requirements in terms of resource usage and latency. Further optimisations in Very High-Speed Integrated Circuit Hardware Description Language (VHDL) allowed fulfilment of the requirements of processing 384 channels per FPGA with a latency smaller than 125 ns.

artificial intelligence, implementation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1088/1748-0221/18/05/P05017

2302.07555

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)

Genre: Research Report (0.40)

Industry: Materials > Chemicals > Industrial Gases > Liquified Gas (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

Phenotype Detection in Real World Data via Online MixEHR Algorithm

Xu, Ying, Gauriau, Romane, Decker, Anna, Oppenheim, Jacob

arXiv.org Artificial IntelligenceNov-15-2022

Understanding patterns of diagnoses, medications, procedures, and laboratory tests from electronic health records (EHRs) and health insurer claims is important for understanding disease risk and for efficient clinical development, which often require rules-based curation in collaboration with clinicians. We extended an unsupervised phenotyping algorithm, mixEHR, to an online version allowing us to use it on order of magnitude larger datasets including a large, US-based claims dataset and a rich regional EHR dataset. In addition to recapitulating previously observed disease groups, we discovered clinically meaningful disease subtypes and comorbidities. This work scaled up an effective unsupervised learning method, reinforced existing clinical knowledge, and is a promising approach for efficient collaboration with clinicians.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.07549

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.30)

Add feedback

Towards Infield Navigation: leveraging simulated data for crop row detection

de Silva, Rajitha, Cielniak, Grzegorz, Gao, Junfeng

arXiv.org Artificial IntelligenceApr-4-2022

Agricultural datasets for crop row detection are often bound by their limited number of images. This restricts the researchers from developing deep learning based models for precision agricultural tasks involving crop row detection. We suggest the utilization of small real-world datasets along with additional data generated by simulations to yield similar crop row detection performance as that of a model trained with a large real world dataset. Our method could reach the performance of a deep learning based crop row detection model trained with real-world data by using 60% less labelled real-world data. Our model performed well against field variations such as shadows, sunlight and grow stages. We introduce an automated pipeline to generate labelled images for crop row detection in simulation domain. An extensive comparison is done to analyze the contribution of simulated data towards reaching robust crop row detection in various real-world field scenarios.

artificial intelligence, machine learning, row detection, (16 more...)

arXiv.org Artificial Intelligence

2204.01811

Country:

Europe > United Kingdom > England > Lincolnshire > Lincoln (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.82)

Industry: Food & Agriculture > Agriculture (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The 6-Ds of Creating AI-Enabled Systems

Piorkowski, John

arXiv.org Artificial IntelligenceFeb-4-2022

We are entering our tenth year of the current Artificial Intelligence (AI) spring, and, as with previous AI hype cycles, the threat of an AI winter looms. AI winters occurred because of ineffective approaches towards navigating the technology valley of death. The 6-D framework provides an end-to-end framework to successfully navigate this challenge. The 6-D framework starts with problem decomposition to identify potential AI solutions, and ends with considerations for deployment of AI-enabled systems. Each component of the 6-D framework and a precision medicine use case is described in this paper.

ai technology, ai-enabled system, decomposition, (15 more...)

arXiv.org Artificial Intelligence

2202.03172

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Prince George's County > Laurel (0.04)
North America > United States > Florida > Orange County > Orlando (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Applied AI (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback